Overview

So far, we’ve been using just the simple mean to make predictions. Today, we’ll continue using the simple mean to make predictions, but now in a complicated way. Before, when we calculated conditional means, we did so in certain “groupings” of variables. When we run linear regression, we no longer need to do so. Instead, linear regression allows us to calculate the conditional mean of the outcome at every value of the predictor. If the predictor takes on just a few values, then that’s the number of conditional means that will be calculated. If the predictor is continuous and takes on a large number of values, we’ll still be able to calculate the conditional mean at every one of those values.

The model we posit for regression is as follows:

\[Y=\beta_0+\beta_1 x_1 +\beta_2 x_2+ ... \beta_k x_k + \epsilon\]

It’s just a linear, additive model. Y increases or decreases as a function of x, with multiple x’s included. \(\epsilon\) is the extent to which an individual value is above or below the line created. \(\beta\) is the amount that \(Y\) increases or decreases for a one unit change in \(x\).

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.3     ✓ purrr   0.3.4
## ✓ tibble  3.1.0     ✓ dplyr   1.0.5
## ✓ tidyr   1.1.3     ✓ stringr 1.4.0
## ✓ readr   1.4.0     ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
## ── Attaching packages ────────────────────────────────────── tidymodels 0.1.2 ──
## ✓ broom     0.7.6      ✓ recipes   0.1.15
## ✓ dials     0.0.9      ✓ rsample   0.0.8 
## ✓ infer     0.5.3      ✓ tune      0.1.2 
## ✓ modeldata 0.1.0      ✓ workflows 0.2.1 
## ✓ parsnip   0.1.4      ✓ yardstick 0.0.8
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## x scales::discard() masks purrr::discard()
## x dplyr::filter()   masks stats::filter()
## x recipes::fixed()  masks stringr::fixed()
## x dplyr::lag()      masks stats::lag()
## x yardstick::spec() masks readr::spec()
## x recipes::step()   masks stats::step()
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout

The Data

ad<-readRDS("area_data.Rds")
ad
## # A tibble: 926 x 13
## # Groups:   name [926]
##    name  college_educ perc_commute_30p perc_insured perc_homeown geoid income_75
##    <chr>        <dbl>            <dbl>        <dbl>        <dbl> <chr>     <dbl>
##  1 Big …         15.7             34.0         96.8         65.8 13720      20.9
##  2 Bill…         31.6             19.4         95.6         73.0 13740      39.5
##  3 Bing…         27.9             22.9         97.7         71.2 13780      35.6
##  4 Birm…         31.4             42.3         92.4         71.4 13820      38.6
##  5 Bism…         33.3             17.0         98.2         74.7 13900      47.8
##  6 Blac…         21.2             38.7         96.3         78.6 13940      33.9
##  7 Blac…         35.2             28.5         96.7         60.5 13980      33.6
##  8 Bloo…         44.8             19.2         98.7         68.5 14010      45.7
##  9 Bloo…         40.8             26.7         95.5         61.0 14020      32.7
## 10 Bloo…         24.9             30.8         98.9         72.2 14100      33.0
## # … with 916 more rows, and 6 more variables: perc_moved_in <dbl>,
## #   perc_in_labor_force <dbl>, metro <chr>, state <chr>, region <fct>,
## #   division <fct>

This data comes from the American Community Survey of 2019. It covers all of the metro or micro statistical areas in the United States. It includes characteristics of these areas, include education, income, home ownership and others as described below.

Name Description
name Name of Micro/Metro Area
college_educ Percent of population with at least a bachelor’s degree
perc_commute_30p Percent of population with commute to work of 30 minutes or more
perc_insured Percent of population with health insurance
perc_homeown Percent of housing units owned by occupier
geoid Geographic FIPS Code (id)
income_75 Percent of population with income over 75,000
perc_moved_in Percent of population that moved from another state in last year
perc_in_labor force Percent of population in labor force
metro Metropolitan Area? Yes/No
state State Abbreviation
region Census Region
division Census Division

Dependent Variable

Our dependent variable will be the percent of the population that has a yearly income over $75,000. Let’s take a look at this variable.

ad%>%
  ggplot(aes(x=income_75))+
  geom_density()

This variable has a nice symmetric distribution. It looks approximately normal, which will help in interpreting the results.

Testing and Training

The essence of prediction is discovering the extent to which our models can predict outcomes for data that does not come from our sample. Many times this process is temporal. We fit a model to data from one time period, then take predictors from a subsequent time period to come up with a prediction in the future. For instance, we might use data on team performance to predict the likely winners and losers for upcoming soccer games.

This process does not have to be temporal. We can also have data that is out of sample because it hadn’t yet been collected when our first data was collected, or we can also have data that is out of sample because we designated it as out of sample.

The data that is used to generate our predictions is known as training data. The idea is that this is the data used to train our model, to let it know what the relationship is between our predictors and our outcome. So far, we have only worked with training data.

That data that is used to validate our predictions is known as testing data. With testing data, we take our trained model and see how good it is at predicting outcomes using out of sample data.

One very simple approach to this would be to cut our data in half. We could then train our model on half the data, then test it on the other half. This would tell us whether our measure of model fit (e.g. rmse) is similar or different when we apply our model to out of sample data. That’s what we’re going to do in this lesson. We’ll split the data randomly in half, with one half used to train our models and the other half used to test the model against out of sample data.

I use the initial_split command to designate the way that the data should be split– in this case in half (.5). The testing data (which is a random half of the original dataset) is stored as ad_test. The training data is stored as ad_train. We’ll run our models on ad_train.

Note: the set.seed command ensures that your random split should be the same as my random split.

Training and Testing Data

set.seed(35202)

split_data<-ad%>%initial_split(prop=.5)

ad_train<-training(split_data)

ad_test<-testing(split_data)

Visualizing the Relationship

ad%>%
  ggplot(aes(x=college_educ,y=income_75))+
  geom_point()+
  geom_smooth(method="lm")
## `geom_smooth()` using formula 'y ~ x'

It’s a good idea to plot the data before running a regression. The scatterplot above shows the percent of the population with incomes above 75,000 on the y axis and the percent of the population with a bachelor’s degree on the x axis. I always like to be able to identify the individual units in datasets like this. To do that I use the ggplotly command below.

An Interactive Approach

gg<-ad%>%
  ggplot(aes(x=college_educ,y=income_75,label=name))+
  geom_point()+
  geom_smooth(method="lm")

  ggplotly(gg)
## `geom_smooth()` using formula 'y ~ x'

By using the plotly command, we can identify the individual units. Try it out, seeing which units have high or low incomes or higher or lower percentages of the population with a bachelor’s degree.

Set Formula

To start training our model, we need to specify a formula. The dependent variable always goes on the left hand side, while the independent variable or variables always go on the right hand side. The formula below is for a model with the percent of the population with incomes above 75,000 on the left hand side, and the percent of the population with a college education on the right hand side.

income_formula<-as.formula("income_75~college_educ")

Define Model

The next step is to define the model. The type of model I want is a linear regression, which I specify below. This is also sometimes called “ordinary least squares” regression.

lm_fit <- 
  linear_reg() %>% 
  set_engine("lm")%>%
  set_mode("regression")

Fit Model to the Data

Now we’re ready to actually fit the model to the data. In the fit command below I specify which formula should be used and which dataset to fit the linear regression specified above.

lm_results<-
  lm_fit%>%
  fit(income_formula,data=ad_train)

Look at the Results

To examine the results, we can use the summary command.

summary(lm_results$fit)
## 
## Call:
## stats::lm(formula = income_75 ~ college_educ, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -20.542  -3.808  -0.257   3.393  24.482 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  17.59764    0.81360   21.63   <2e-16 ***
## college_educ  0.69211    0.03178   21.78   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.193 on 461 degrees of freedom
## Multiple R-squared:  0.5071, Adjusted R-squared:  0.506 
## F-statistic: 474.2 on 1 and 461 DF,  p-value: < 2.2e-16

What this tells us is that for each additional percent of the population with a bachelor’s degree, the percent of the population with incomes above 75,000 is predicted to increase by 0.69

Predict using the testing dataset

ad_test<-lm_results%>% ## start with the results
  predict(new_data=ad_test)%>% ## create a prediction
  rename(pred1=.pred)%>%  ## rename the prediction
  bind_cols(ad_test) ## add the prediction to the testing dataset

Now we have the prediction in our testing dataset, named pred1 using the rename command. We can compare the actual value with the predicted value in the testing data using rmse.

Calculate RMSE

rmse_1<-rmse(ad_test,
     truth=income_75,
     estimate=pred1)

rmse_1
## # A tibble: 1 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard        6.00

The rmse of 6 indicates how close our prediction is to the actual value in the testing dataset. This is important because it gives a sense of how the model we trained performs when trying to make predictions using new– or out of sample– data.

Quick Exercise Run a regression using a different predictor. Calculate rmse and see if you can beat my score.

Multiple regression

We can continue refining our model by adding more variables. Regression with more than one predictor is called multiple regression. The nice thing is that the steps will all be the same, we just need to change the formula and the rest stays pretty much as it was.

Set Formula

income_formula<-as.formula("income_75~college_educ+perc_homeown")

Define model

lm_fit <- 
  linear_reg() %>% 
  set_engine("lm")%>%
  set_mode("regression")

Fit Model to the Ddta

lm_results<-
  lm_fit%>%
  fit(income_formula,data=ad_train)

Look at the results

summary(lm_results$fit)
## 
## Call:
## stats::lm(formula = income_75 ~ college_educ + perc_homeown, 
##     data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -18.5426  -4.0456  -0.3423   3.0812  26.5276 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   5.13623    3.28010   1.566 0.118066    
## college_educ  0.70876    0.03159  22.439  < 2e-16 ***
## perc_homeown  0.17422    0.04447   3.918 0.000103 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.099 on 460 degrees of freedom
## Multiple R-squared:  0.523,  Adjusted R-squared:  0.5209 
## F-statistic: 252.2 on 2 and 460 DF,  p-value: < 2.2e-16

Predict using the testing dataset

ad_test<-lm_results%>%
  predict(new_data=ad_test)%>%
  rename(pred2=.pred)%>%
  bind_cols(ad_test)

Calculate RMSE

rmse_2<-rmse(ad_test,
     truth=income_75,
     estimate=pred2)

rmse_2
## # A tibble: 1 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard        5.70

The rmse of 5.7 shows that we do get a more accurate prediction in the testing dataset using two predictors.

Quick Exercise Run a regression using two different predictors. Calculate rmse and see if you can beat my score.

You MUST remember: correlation is not causation. All you can pick up on using this tool is associations, or common patterns. You can’t know whether one thing causes another. Remember that the left hand side variable could just as easily be on the right hand side.